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Abstract 

Convolutional Neural Networks (CNN) are state-of-the- 
art models for many image classification tasks. However, to 
recognize cancer subtypes automatically, training a CNN 
on gigapixel resolution Whole Slide Tissue Images (WSI) 
is currently computationally impossible. The differentia¬ 
tion of cancer subtypes is based on cellular-level visual 
features observed on image patch scale. Therefore, we ar¬ 
gue that in this situation, training a patch-level classifier 
on image patches will perform better than or similar to 
an image-level classifier. The challenge becomes how to 
intelligently combine patch-level classification results and 
model the fact that not all patches will be discriminative. 
We propose to train a decision fusion model to aggregate 
patch-level predictions given by patch-level CNNs, which to 
the best of our knowledge has not been shown before. Fur¬ 
thermore, we formulate a novel Expectation-Maximization 
(EM) based method that automatically locates discrimina¬ 
tive patches robustly by utilizing the spatial relationships 
of patches. We apply our method to the classification of 
glioma and non-small-cell lung carcinoma cases into sub- 
types. The classification accuracy of our method is simi¬ 
lar to the inter-observer agreement between pathologists. 
Although it is impossible to train CNNs on WSIs, we ex¬ 
perimentally demonstrate using a comparable non-cancer 
dataset of smaller images that a patch-based CNN can out¬ 
perform an image-based CNN. 

1. Introduction 

Convolutional Neural Networks (CNNs) are currently 
the state-of-the-art image classifiers BOl l29l 171 12^ . How¬ 


ever, due to high computational cost, CNNs cannot be ap¬ 
plied to very high resolution images, such as gigapixel 
Whole Slide Tissue Images (WSI). Classification of cancer 
WSIs into grades and subtypes is critical to the study of dis¬ 
ease onset and progression and the development of targeted 
therapies, because the effects of cancer can be observed in 
WSIs at the cellular and sub-cellular levels (Fig. 0- Apply¬ 
ing CNN directly for WSI classification has several draw¬ 
backs. First, extensive image downsampling is required by 
which most of the discriminative details could be lost. Sec¬ 
ond, it is possible that a CNN might only learn from one of 
the multiple discriminative patterns in an image, resulting 
in data inefficiency. Discriminative information is encoded 
in high resolution image patches. Therefore, one solution is 
to train a CNN on high resolution image patches and predict 
the label of a WSI based on patch-level predictions. 

The ground truth labels of individual patches are un¬ 
known, as only the image-level ground truth label is given. 
This complicates the classification problem. Because tu¬ 
mors may have a mixture of structures and texture proper¬ 
ties, patch-level labels are not necessarily consistent with 
the image-level label. More importantly, when aggregat¬ 
ing patch-level labels to an image-level label, simple deci¬ 
sion fusion methods such as voting and max-pooling are not 
robust and do not match the decision process followed by 
pathologists. For example, a mixed subtype of cancer such 
as oligoastrocytoma, might have distinct regions of other 
cancer subtypes. Therefore, neither voting nor max-pooling 
could predict the correct WSTlevel label since the patch- 
level predictions do not match the WSTlevel label. 

We propose using a patch-level CNN and training a de¬ 
cision fusion model as a two-level model, shown in Fig. 
The hrst-level (patch-level) model is an Expectation Maxi- 
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Figure 1: A gigapixel Whole Slide Tissue Image of a grade 
IV tumor (best viewed in color). Visual features that deter¬ 
mine the subtype and grade of a WSI are visible in high res¬ 
olution details. In this case, patches framed in red are dis¬ 
criminative since they show typical visual features of grade 
IV tumor. Patches framed in blue are non-discriminative 
because they only contain visual features from lower grade 
tumors. Notice that discriminative patches are dispersed 
throughout the image at multiple locations. 


mization (EM) based method combined with CNN that out¬ 
puts patch-level predictions. In particular, we assume that 
there is a hidden variable associated with each patch ex¬ 
tracted from an image that indicates whether the patch is 
discriminative or not. Here, “a discriminative patch” means 
that the true hidden label of the patch is the same as the 
true label of the image. Initially, we consider all patches to 
be discriminative. We train a CNN model that outputs the 
cancer type probability of each input patch. We apply spa¬ 
tial smoothing to the resulting probability map and select 
only patches with higher probability values as discrimina¬ 
tive patches. We iterate this process using the new set of 
discriminative patches in an EM fashion until convergence. 
In the second-level (image-level), histograms of patch-level 
predictions are input into an image-level multiclass logistic 
regression or Support Vector Machine (SVM) IfTOl model 
that predicts the image-level labels. 

Pathology image classification and segmentation is an 
active field of research. Most WSI classification methods 
focus on classifying or extracting features on patches ini 
ESmalElinillllllllllllHl. InHSla pretrained CNN 
model extracts features on patches which are then aggre¬ 
gated for WSI classification. As shown by our experiments, 
the heterogeneity of some cancer subtypes cannot be cap- 


First-level 
model training 


The pixel intensities are the predicted 
probabilities (output of CNN) that the 
corresponding patches have the 
same label as the image. 





Eigure 2: An overview of our workflow (best viewed in 
color). Top: A CNN is trained on patches. An EM-based 
method iteratively identifies non-discriminative patches and 
eliminates them from the CNN training set. Bottom: An 
image-level decision fusion model is trained on histograms 
of patch-level predictions, to predict the image-level label. 


tured by those generic CNN features. Patch-level super¬ 
vised classifiers can learn the heterogeneity of cancer sub- 
types, if a lot of patch labels are provided iflTl 1^ . How¬ 
ever, acquiring such labels in large scale has prohibitive 
cost, due to the need for highly specialized annotators. As 
digitization of tissue samples becomes increasingly com¬ 
monplace, one can envision large scale datasets, that could 
not be annotated at patch scale. Utilizing unlabeled patches 
has led to Multiple Instance Learning (MIL) based WSI 
classification ifTbl l49l l50l . 

In the MIL paradigm ITSlfS^ ISl. unlabeled instances be¬ 
long to labeled bags of instances. The goal is to predict the 
label of a new bag and/or the label of each instance. The 
Standard Multi-Instance (SMI) assumption m states that 
for a binary classification problem, a bag is positive iff there 
exists at least one positive instance in the bag. The probabil¬ 
ity of a bag being positive equals to the maximum positive 
prediction over all of its instances ll^ l52l l27ll . Combining 
MIL with Neural Networks (NN) ||4ll|55l|3l][l3], the SMI 
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assumption is modeled by max-pooling. Following this for¬ 
mulation, the Back Propagation for Multi-Instance Prob¬ 
lems (BP-MIP) ll4n l55l performs back propagation along 
the instance with the maximum response if the bag is posi¬ 
tive. This is inefficient because only one instance per bag is 
trained in one training iteration on the whole bag. 

MIL-based CNNs have been applied to object recogni¬ 
tion and semantic segmentation IMl in image analy¬ 
sis - the image is the bag and image-windows are the in¬ 
stances lf34l . These methods also follow the SMI assump¬ 
tion. The training error is only propagated through the 
object-containing window which is also assumed to be the 
window that has the maximum prediction confidence. This 
is not robust because one significantly misclassified window 
might be considered as the object-containing window. Ad¬ 
ditionally, in WSIs, there might be multiple windows that 
contain discriminative information. Recent semantic im¬ 
age segmentation approaches lfT^[3^[37ll smooth the output 
probability (feature) maps of the CNNs. In this way, they 
can identify relevant windows more robustly. 

To predict the image-level label, max-pooling (SMI) and 
voting (average-pooling) were applied in ll^[^[T7l . How¬ 
ever, it has been shown that in many applications, learning 
decision fusion models can significantly improve perfor¬ 
mance compared to voting I40ll43ll24ll4^l2^l44ll . Further¬ 
more, such a learned decision fusion model is based on the 
Count-based Multiple Instance (CMl) assumption which is 
the most general MIL assumption BtI . 

Our main contributions in this paper are: 

1. To the best of our knowledge, we are the first to com¬ 
bine patch-level CNNs with supervised decision fu¬ 
sion. Aggregating patch-level CNN predictions for 
WSI classification significantly outperforms patch- 
level CNNs with max-pooling or voting. 

2. We propose a new EM-based model that identifies dis¬ 
criminative patches in high resolution images automat¬ 
ically for patch-level CNN training, utilizing the spa¬ 
tial relationship between patches. 

3. Our model achieves multiple state-of-the-art results 
classifying WSIs to cancer subtypes on the TCGA 
dataset. Our results are similar or close to inter¬ 
observer agreement between pathologists. Larger clas¬ 
sification improvements are observed in the harder-to- 
classify cases. 

4. We provide experimental evidence that combining 
multiple patch-level classifiers might actually be ad¬ 
vantageous compared to whole image classification. 

The rest of this paper is organized as follows. Sec. 
describes the framework of the EM-based MIL algorithm. 
Sec. [^discusses the identification of discriminative patches. 


Sec. 1^ explains the image-level model that predicts the 
image-level label by aggregating patch-level predictions. 
Sec. 1^ shows experimental results. The paper concludes in 
Sec.j^ App.[A|lists the cancer subtypes in our experiments. 

2. EM-based method with CNN 

An overview of our EM-based method can be found in 
Fig|2] We model a high resolution image as a bag and 
patches extracted from it as instances. We have a ground 
truth label for the whole image but not for the individual 
patches. We model whether an instance is discriminative or 
not as a hidden binary variable. 

We denote X = {Xi,X 2 , ..., -Aa?} as the dataset con¬ 
taining N bags. Each bag Xi = {Xi^i,Xi^ 2 , • ■ ■, 
consists of Ni instances, where Xi j = {xij, yi) is the j-th 
instance and its associated label in the i-th bag. Assuming 
the bags are independent and identically distributed (i.i.d.), 
the X and the hidden variables H are generated by the fol¬ 
lowing generative model: 

N 

PiX,H) = Y[(^P{X,^i,...,X,,N, I (1) 

i=l 

where the hidden variable H = {Hi, H 2 , ■ ■ ■, H^}, Hi = 
{Hi^i, Hi^ 2 , • ■ •) Hi^iy.} and Hij is the hidden variable that 
indicates whether instance Xij is discriminative for label yi 
of bag Xi. We further assume that all Xi j depends on Hi j 
only and are independent with each other given Hi j. Thus 

N Ni 

P{X,H) = nn I (2) 

i=l j = l 

We maximize the data likelihood P{X) using EM. 

1. At the initial E step, we set Hi j = 1 for all i,j. This 
means that all instances are considered discriminative. 

2. M step: We update the model parameter 0 to maximize 
the data likelihood 

9 ^ arg maxP(X \ H-,9) 
e 

= arg max TT P{xij,yi\9) 

0 ..Ad ( 3 ) 

X Pi^p.qtVq I 

D 

where D is the discriminative patches set. Assuming 
a uniform generative model for all non-discriminative 
instances, the optimization in Eq. [^simplifies to: 

arg max TT P{xi^j,yi\9) 

Xi i GD 

(4) 

= argmax TT P{yi \ Xij;9)P{xi^j \ 9). 

Xi^j GD 
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Additionally we assume an uniform distribution over 
Xij. Thus Eq. describes a discriminative model (in 
this paper we use a CNN). 

3. E step: We estimate the hidden variables H. In par¬ 
ticular, Hi j = 1 if and only if P {Hi j \ X) is above 
a certain threshold. In the case of image classihca- 
tion, given the i-th image, P{Hij \ X) is obtained by 
applying Gaussian smoothing on P{yi \ Xij]0) (De¬ 
tailed in Sec|^. This smoothing step utilizes the spatial 
relationship of P{yi \ Xij;9) in the image. We then 
iterate back to the M step till convergence. 

Many MIL algorithms can be interpreted through this 
formulation. Based on the SMI assumption, the instance 
with the maximum P{Hij \ X) is the discriminative in¬ 
stance for the positive bag, as in the EM Diverse Density 
(EM-DD) ||53l and the BP-MIP EIllsSI algorithms. 

3. Discriminative patch selection 

Patches Xij that have P{Hij \ X) larger than a thresh¬ 
old Tij are considered discriminative and are selected to 
continue training the CNN. We present in this section the 
estimation of P{H \ X) and the choice of the threshold. 

It is reasonable to assume that P{Hij | X) is correlated 
with P{yi I Xij',6), i.e, patches with lower P{yi \ Xij;9) 
tend to have lower probability Xij to be discriminative. 
However, a hard-to-classify patch, or a patch close to the 
decision boundary may have low P{yi \ Xij;9) as well. 
These patches are informative and should not be rejected. 
Therefore, to obtain a more robust P{Hij \ X), we apply 
the following two steps: Eirst, we train two CNNs on two 
different scales in parallel. P{yi \ Xij]9) is the averaged 
prediction of the two CNNs. Second, we simply denoise 
the probability map P{yi \ Xij;9) of each image with a 
Gaussian kernel to compute P{Hij \ X). 

Choosing a thresholding scheme carefully yields sig- 
nihcantly better performance than a simpler thresholding 
scheme BtI . We obtain the threshold Tij for P{H^j \ X) 
as follows: We note Si as the set of P{Hij \ X) values for 
all Xij of the i-th image and Ec as the set of P{Hij \ X) 
values for all Xij of the c-th class. We introduce the image- 
level threshold Hi as the Pi-th percentile of Si and the 
class-level threshold i?, as the P 2 -th percentile of E^, where 
Pi and P 2 are predehned. The threshold Tij is dehned 
as the minimum value between Hi and Ri. There are two 
advantages of our method. Eirst, by using the image-level 
threshold, there are at least 1 — Pi percent of patches that 
are considered discriminative for each image. Second, by 
using the class-level threshold, the thresholds can be easily 
adapted to classes with different prior probabilities. 


4. Image-level decision fusion model 

We combine the patch-level classihers of Sec. |^to pre¬ 
dict the image-level label. We input all patch-level pre¬ 
dictions into a multi-class logistic regression or SVM that 
outputs the image-level label. This decision level fusion 
method ESi is more robust than max-pooling na. More¬ 
over, this method can be thought of as a Count-based Mul¬ 
tiple Instance (CMI) learning method with two-level learn¬ 
ing BtI which is a more general MIL assumption lIMI than 
the Standard Multiple Instance (SMI) assumption. 

There are three reasons for combining multiple in¬ 
stances: First, on difficult datasets, we do not want to assign 
an image-level prediction simply based on a single patch- 
level prediction (as is the case of the SMI assumption ifTSl l. 
Second, even though certain patches are not discriminative 
individually, their joint appearance might be discriminative. 
For example, a WSI of the “mixed” glioma, Oligoastrocy- 
toma (see App. 0 should be recognized when two single 
glioma subtypes (Oligodendroglioma and Astrocytoma) are 
jointly present on the slide possibly on non-overlapping re¬ 
gions. Third, because the patch-level model is never perfect 
and probably biased, an image-level decision fusion model 
may learn to correct the bias of patch-level decisions. 

In particular, the class histogram of the patch-level pre¬ 
dictions is the input to a linear multi-class logistic regres¬ 
sion model ii or an SVM with Radial Basis Function 
(RBF) kernel [TOl. To generate the histogram, we simply 
sum up all of the class probabilities given by the patch- 
level CNN. Moreover, we concatenate histograms from four 
CNNs models: CNNs trained at two patch scales for two 
different numbers of iterations. We found in practice that 
concatenating multiple histograms is robust. 

5. Experiments 

We evaluate our method on two Whole Slide Tissue Im¬ 
ages (WSI) classification problems: classification of glioma 
and Non-Small-Cell Lung Carcinoma (NSCLC) cases into 
glioma and NSCLC subtypes. Glioma is a type of brain 
cancer that rises from glial cells. It is the most common ma¬ 
lignant brain tumor and the leading cause of cancer-related 
deaths in people under age 20 m. NSCLC is the most 
common lung cancer, which is the leading cause of cancer- 
related deaths overall IJl. Classifying glioma and NSCLC 
into their respective subtypes and grades is crucial to the 
study of disease onset and progression in order to provide 
targeted therapies. The dataset of WSIs used in the exper¬ 
iments is composed from the public Cancer Genome Atlas 
(TCGA) dataset E). It contains detailed clinical informa¬ 
tion and the Hematoxylin and Eosin (H&E) stained images 
of various cancers. The typical resolution of a WSI in this 
dataset is lOOK by 50K pixels. In the rest of this section, we 
hrst describe the algorithm we tested then show the evalua- 


4 



(a) GSM (h)OD (c) 04 (d) DA (e) SCC if) ADC 

Figure 3; Some 20X sample patches of gliomas and Non-Small-Cell Lung Carcinoma (NSCLC) from the TCGA dataset. Two 
patches in each column belong to the same subtype of cancer. Notice the large intra-class heterogeneity. 


tion results on the glioma and NSCLC classification tasks. 

5.1. Patch extraction and segmentation 

To train the CNN model, patches of size 500 by 500 
are extracted from WSIs. To capture structures at multi¬ 
ple scales, we extract patches from 20X (0.5 microns per 
pixel) and 5X (2.0 microns per pixel) objective magnifica¬ 
tions. Patches that contain less than 30% tissue sections 
or have too much blood are discarded. Around 1000 valid 
patches per image per scale are extracted. In most cases the 
patches are non-overlapping given the resolution of a WSI. 
Fig.j^shows sample patches. 

To prevent the CNN from severe overfitting, we perform 
three kinds of data augmentation in every iteration. First, 
a random 400 by 400 sub-patch is selected from each 500 
by 500 patch. Second, the sub-patch is randomly rotated 
and mirrored. Third, the amount of Hematoxylin and eosin 
stained on the tissue is randomly adjusted. This is done by 
decomposing the RGB color of the tissue into H&E color 
space ll42ll . followed by multiplying the magnitude of H and 
E of every pixel by two i.i.d. Gaussian random variables 
with expectation equal to one. 

5.2. CNN architecture 

The architecture of our CNN is shown in Tab.[T] We used 
the CAEEE tool box ll25l for the CNN implementation. The 
network was trained on a single NVidia Tesla 40K GPU. 

5.3. Experiment setup 

The WSIs of 80% of the patients are randomly selected 
to train the model and the remaining 20% to test. Depending 
on method, training patches are further divided into i) CNN 
and ii) decision fusion model training sets. We separate the 
data twice and average the results. Tested algorithms are: 


Layer 

Pilter size, stride 

Output WxHxN 

Input 

- 

400 X 400 X 3 

Conv 

10 X 10, 2 

196 X 196 X 80 

ReLU-tLRN 

- 

196 X 196 X 80 

Max-pool 

6 X 6, 4 

49 X 49 X 80 

Conv 

5 X 5, 1 

45 X 45 X 120 

ReLU-tLRN 

- 

45 X 45 X 120 

Max-pool 

3 X 3, 2 

22 X 22 X 120 

Conv 

3 X 3, 1 

20 X 20 X 160 

ReLU 

- 

20 X 20 X 160 

Conv 

3 X 3, 1 

18 X 18 X 200 

ReLU 

- 

18 X 18 X 200 

Max-pool 

3 X 3, 2 

9 X 9 X 200 

PC 

- 

320 

ReLuH-Drop 

- 

320 

PC 

- 

320 

ReLuH-Drop 

- 

320 

PC 

- 

Dataset dependent 

Softmax 

- 

Dataset dependent 


Table 1: The architecture of our CNN used in glioma and 
NSCLC classification. ReLU-k-LRN is a sequence of Recti¬ 
fied Linear Units (ReLU) followed by Local Response Nor¬ 
malization (LRN). Similarily, ReLU+Drop is a sequence of 
ReLU followed by dropout. The dropout probability is 0.5. 

1. CNN-Vote: CNN followed by voting (average¬ 
pooling). All patches extracted from a WSI are used 
to train the patch-level CNN. There is no second-level 
model. Instead, the final predicted label of a WSI is 
voted by the predictions of all patches. 

2. CNN-SMI: CNN followed by max-pooling. Same as 
CNN-Vote except the final predicted label of a WSI 
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equals to the predicted label of the patch with maxi¬ 
mum probability over all other patches and classes. 

3. CNN-Fea-SVM; We apply feature fusion instead of 
decision level fusion. In particular, the outputs of the 
second fully connected layer of the CNN on all patches 
are aggregated by 3-norm pooling ll48ll . Then an SVM 
with RBF kernel is applied to predict the image-level 
label given fused features. 

4. EM-CNN-Vote/SMI, EM-CNN-Fea-SVM: EM-based 
method with CNN-Vote, CNN-SMI, CNN-Eea-SVM 
respectively. The patch-level EM-CNN is trained on 
discriminative patches identified by the E-step. De¬ 
pending on the dataset, the discriminative threshold Pi 
for each image ranges from 0.18 to 0.25; the discrimi¬ 
native threshold P 2 for each class ranges from 0.05 to 
0.28 (details in Sec. [^. In each M-step, the CNN is 
trained on all the discriminative patches for 2 epochs. 

5. EM-Einetune-CNN-Vote/SMI: Similar to EM-CNN- 
Vote/SMI except that instead of training a CNN 
from scratch, we fine-tune a pretrained 16-layer CNN 
model II 44 I by training it on discriminative patches. 

6. CNN-LR; CNN followed by logistic regression. Same 
as CNN-Vote except that we train a second-level multi¬ 
class logistic regression to predict the image-level la¬ 
bel. One tenth of the patches in each image is held 
out from the CNN to train the second-level multi-class 
logistic regression. 

7. CNN-SVM: CNN followed by SVM with RBE kernel 
instead of logistic regression. 

8. EM-CNN-LR/S VM: EM-based method with CNN-LR 
and CNN-SVM respectively. 

9. EM-CNN-LR w/o spatial smoothing: No Gaussian 
smoothing is applied to estimate P{H \ X). Other 
parts are the same as EM-CNN-LR. 

10. EM-Einetune-CNN-LR/SVM: Similar to EM-CNN- 
LR/SVM except that instead of training a CNN from 
scratch, we fine-tune a pretrained 16-layer CNN 
model El by training it on discriminative patches. 

11. SMI-CNN-SMI: CNN with max-pooling at both dis¬ 
criminative patch identification and image-level pre¬ 
diction steps. Eor the patch-level CNN training, in 
each WSI only one patch with the highest confidence 
is considered discriminative. 

12. NM-LBP: Nuclei Morphological features ini and ro¬ 
tation invariant Local Binary Patterns 051 are ex¬ 
tracted from all patches. A Bag-of-Words (BoW) flO] 
ED feature is built using k-means followed by SVM 
with RBF kernel ifTOl . This is a non-CNN baseline. 


13. Pretrained-CNN-Fea-SVM: Similar to CNN-Fea- 
SVM. But instead of training a CNN, we use a pre¬ 
trained 16-layer CNN model ll44l to extract features 
from patches. Then we select the top 500 features ac¬ 
cording to accuracy on the training set HSl . 

14. Pretrained-CNN-Bow-SVM: We build a BoW model 
using k-means on features extracted by the pretrained 
CNN, followed by SVM El. 

5.4. WSI of glioma classification 

There are WSIs of six subtypes of glioma in the TCGA 
dataset 0. The numbers of WSIs and patients in each class 
are shown in Tab. All classes are described in App. [A] 


Gliomas 

GBM 

OD 

OA 

DA 

AA 

AO 

# patients 

209 

100 

106 

82 

29 

13 

#WSIs 

510 

206 

183 

114 

36 

15 


Table 2: The numbers of WSIs and patients in each class 
from the TCGA dataset. Class descriptions are in App. [A] 


Methods 

Acc 

mAP 

CNN-Vote 

0.710 

0.812 

CNN-SMI 

0.710 

0.822 

CNN-Fea-SVM 

0.688 

0.790 

EM-CNN-Vote 

0.733 

0.837 

EM-CNN-SMI 

0.719 

0.823 

EM-CNN-Fea-SVM 

0.686 

0.790 

EM-Finetune-CNN-Vote 

0.719 

0.817 

EM-F inetune -CNN -SMI 

0.638 

0.758 

CNN-LR 

0.752 

0.847 

CNN-SVM 

0.697 

0.791 

EM-CNN-LR 

0.771 

0.845 

EM-CNN-LR w/o spatial smoothing 

0.745 

0.832 

EM-CNN-SVM 

0.730 

0.818 

EM-Finetune-CNN-LR 

0.721 

0.822 

EM-Finetune-CNN-SVM 

0.738 

0.828 

SMI-CNN-SMI 

0.683 

0.765 

NM-LBP 

0.629 

0.734 

Pretrained CNN-Eea-SVM 

0.733 

0.837 

Pretrained-CNN-Bow-SVM 

0.667 

0.756 

Chance 

0.513 

0.689 


Table 3: Glioma classification results. The proposed EM- 
CNN-LR method achieved the best result, close to inter¬ 
observer agreement between pathologists. (Sec. 


5.4 


)■ 


The results of our experiments are shown in Tab. 
The confusion matrix is given in Tab. An experiment 
showed that the inter-observer agreement of two experi- 
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Predictions 

Ground Truth 

GBM 

OD 

OA 

DA 

AA 

AO 

GBM 

214 

0 

2 

0 

1 

0 

OD 

1 

47 

22 

2 

0 

1 

OA 

1 

18 

40 

8 

3 

1 

DA 

3 

9 

6 

20 

0 

1 

AA 

3 

2 

3 

3 

4 

0 

AO 

2 

2 

3 

0 

0 

1 


Table 4: Confusion matrix of glioma classification. The na¬ 
ture of Oligoastrocytoma causes the most confusions. See 
Sec. 5.4 for details. 


enced pathologists on a similar dataset^was approximately 
70% and that even after reviewing the cases together, they 
agreed only around 80% of the time ll22l . Therefore, our 
accuracy of 77% is similar to inter-observer agreement. 

In the confusion matrix, we note that the classification 
accuracy between GBM and Low-Grade Glioma (LGG) 
is 97% (chance was 51.3%). A fully supervised method 
achieved 85% accuracy using a domain specific algorithm 
trained on ten manually labeled patches per class ||33l. To 
the best of our knowledge our method is the first to classify 
five LGG subtypes automatically, a much more challeng¬ 
ing classification task than the benchmark GBM vs. LGG 
classification. We achieve 57.1% LGG-subtype classifica¬ 
tion accuracy with chance at 36.7%. Notice that most of the 
confusions are related to oligoastrocytoma (OA) because it 
is a mixed glioma that is challenging even for pathologists 
to agree on, according to a neuropathology study; “Oligoas- 
trocytomas contain distinct regions of oligodendroglial and 
astrocytic differentiation... The minimal percentage of each 
component required for the diagnosis of a mixed glioma has 
been debated, resulting in poor inter-observer reproducibil¬ 
ity for this group of neoplasms.” 0. 

We compare recognition rates for the OA subtype. The 
F-score of OA recognition is 0.426, 0.482, and 0.544 using 
PreCNN-Fea-SVM, CNN-LR, and EM-CNN-LR respec¬ 
tively. We thus see that the improvement over other methods 
becomes increasingly more significant using our proposed 
method on the harder-to-classify classes. 

The discriminative patch (region) segmentation results in 
Fig. 12 demonstrate the quality of our EM-based method. 

5.5. WSI of NSCLC classification 

We use three major subtypes of Non-Small-Cell Lung 
Carcinoma (NSCLC). Numbers of WSIs and patients in 
each class are in Tab. |2 All classes are listed in App. [A| 

Experimental results are shown in Tab. |2 the confusion 
matrix is in Tab. [7] When classifying SCC vs. non-SCC, 

* Results not directly comparable due to possible dataset differences. 



® 


WSIs Pathologist Max-pooling EM 

Figure 4: Examples of discriminative patch (region) seg¬ 
mentation (best viewed in color). Discriminative regions 
are indicated in red. Diagnostic or highly discriminative re¬ 
gions are yellow. Non-discriminative regions are in black. 
Pathologist: ground truth by a pathologist. Max-pooling: 
results by CNN with the SMI assumption (SMI-CNN-SMI). 
The discriminative patches are indicated by red arrows. 
EM: results by our EM-based patch-level CNN (EM-CNN- 
Vote/SMI/LR). Notice that max-pooling does not segment 
enough discriminative regions. 


NSCLCs 

SCC 

ADC 

ADC-mix 

# patients 

347 

291 

80 

#WSIs 

316 

250 

75 


Table 5; The numbers of WSIs and patients in each class 
from the TCGA dataset. Class descriptions are in App. [A] 


inter-observer agreement between pulmonary pathology ex¬ 
perts and between community pathologists measured by 
Cohen’s kappa is k = 0.64 and k = 0.41 respectively lEn. 
We achieved k — 0.75. When classifying ADC vs. non- 
ADC, the inter-observer agreement between experts and be¬ 
tween community pathologists are k = 0.69 and k = 0.46 
respectively ED- We achieved k = 0.60. Therefore, our 
results appear close to inter-observer agreementj^ 

The ADC-mix subtype is hard to classify because it con¬ 
tains visual features of multiple NSCLC subtypes. The 
Pretrained CNN-Fea-SVM method achieves an F-score of 
0.412 recognizing ADC-mix cases, whereas our proposed 
method EM-Finetune-CNN-SVM achieves 0.472. Consis¬ 
tent with the glioma results, our method’s performance ad- 

^ Results not directly comparable due to possible dataset differences. 
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Methods 

Acc 

mAP 

CNN-Vote 

0.702 

0.838 

CNN-SMI 

0.731 

0.852 

CNN-Eea-SVM 

0.637 

0.793 

EM-CNN-Vote 

0.714 

0.842 

EM-CNN-SMI 

0.731 

0.850 

EM-CNN-Eea-SVM 

0.637 

0.791 

EM-Einetune-CNN-Vote 

0.773 

0.877 

EM-Einetune-CNN-SMI 

0.729 

0.853 

CNN-LR 

0.727 

0.845 

CNN-SVM 

0.738 

0.856 

EM-CNN-LR 

0.743 

0.856 

EM-CNN-SVM 

0.759 

0.869 

EM-Einetune-CNN-LR 

0.784 

0.883 

EM-Einetune-CNN-SVM 

0.798 

0.889 

SMI-CNN-SMI 

0.531 

0.749 

Pretrained CNN-Eea-SVM 

0.778 

0.879 

Pretrained-CNN-Bow-SVM 

0.759 

0.871 

Chance 

0.484 

0.715 


Table 6; NSCLC classification results. The proposed EM- 
CNN-SVM and EM-Finetune-CNN-SVM achieved best re¬ 
sults, close to the inter-observer agreement between pathol¬ 
ogists. See Sec. |5.5[/or details. 



Predictions 

Ground Tmth 

see 

ADC 

ADC-mix 

see 

199 

26 

0 

ADC 

30 

155 

11 

ADC-mix 

2 

25 

17 


Table 7: The confusion matrix of NSCLC classification. 

vantages are more pronounced in the hardest cases. 

5.6. Rail surface defect severity grade classification 

A CNN cannot be applied to gigapixel images directly 
because of computational limitations. We argue that even 
when the images are small enough for CNNs, our patch- 
based method compares favorably to an image-based CNN 
if discriminative information is encoded in image patch 
scale and dispersed throughout the images. 

To test our hypothesis, we apply our patch-based method 
to the task of classifying the severity grade of rail surface 
defects. Maintenance of rail surfaces depends on the sever¬ 
ity of surface defects. Automatic defect grading can obviate 
the need for laborious examination and grading of rail sur¬ 
face defects on a regular basis. We used a dataset of 939 
rail surface images with defect severity grades from 0 to 7. 
Typical image resolution is 1200 by 500, as in Fig.|^ 

To support our claim, we tested two additional methods. 

1. CNN-Image: We apply the CNN on image scale di- 



la) Grade 0 (b) Grade 2 (c) Grade 4 (d) Grade 7 


Figure 5: Sample images of rail surfaces. The grade indi¬ 
cates defect severity. Notice that the defects are in image 
patch scale and dispersed throughout the image. 

rectly. In particular, the CNN is trained on 400 by 400 
regions randomly extracted from images in each itera¬ 
tion. At test time, we apply the CNN on five regions 
(top left, top right, bottom left, bottom right, center) 
and average the predictions. 

2. Pretrained CNN-ImageFea-SVM: We apply a pre¬ 
trained 16-layer network m to rail surface images to 
extract features, and train an SVM on these features. 

The CNN used in this experiment has a similar achitec- 
ture to the one described in Tab. [T] with smaller and fewer 
filters. The size of patches in our patch-based methods is 64 
by 64. We apply 4-fold cross-validation and show the aver¬ 
aged results in Tab. Our patch-based method EM-CNN- 
S VM and EM-CNN-Fea-SVM outperform the conventional 
image-based method CNN-Image. Moreover, results using 
CNN features extracted on patches (Pretrained CNN-Eea- 
SVM) are better than results with CNN features extracted 
on images (Pretrained-CNN-ImagePea-SVM). 

6. Conclusions 

We presented a patch-based Convolutional Neural Net¬ 
work (CNN) model with a supervised decision fusion model 
that is successful in Whole Slide Tissue Image (WSI) 
classification. We proposed an Expectation-Maximization 
(EM) based method that identifies discriminative patches 
automatically for CNN training. With our algorithm, we 
can classify subtypes of cancers given WSIs of patients 
with accuracy similar or close to inter-observer agree¬ 
ments between pathologists. Eurthermore, we experimen¬ 
tally demonstrate using a comparable non-cancer dataset 
of smaller images, that the performance of our patch-based 
CNN compare favorably to that of an image-based CNN. In 
future work we will leverage the non-discriminative patches 
as part of the data likelihood in the EM formulation instead 
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Methods 

Acc 

mAP 

CNN-Vote 

0.695 

0.823 

CNN-SMI 

0.700 

0.801 

CNN-Fea-SVM 

0.822 

0.903 

EM-CNN-Vote 

0.683 

0.817 

EM-CNN-SMI 

0.684 

0.799 

EM-CNN-Eea-SVM 

0.830 

0.908 

CNN-LR 

0.764 

0.867 

CNN-SVM 

0.803 

0.886 

EM-CNN-LR 

0.772 

0.871 

EM-CNN-SVM 

0.813 

0.895 

SMI-CNN-SMI 

0.258 

0.461 

Pretrained CNN-Eea-SVM 

0.808 

0.894 

CNN-Image 

0.770 

0.876 

Pretrained CNN-ImagePea-SVM 

0.778 

0.878 

Chance 

0.228 

0.438 


Table 8; Rail surface defect severity grade classification re¬ 
sults. Our patch-based method EM-CNN-SVM and EM- 
CNN-Fea-SVM outperform image-based methods CNN- 
Image and Pretrained CNN-ImageFea-SVM significantly. 

of assuming they are uniformly distributed. We will explore 
ways to optimize CNN-training so that it scales up to the 
large scale pathology datasets that are becoming available. 

Appendix A. Description of cancer subtypes 

The manual classification of Gliomas and Non-Small- 
Cell Lung Carcinomas (NSCLC) into subtypes includes as¬ 
sessment of cell distributions and characteristics such as 
shape and texture, and tissue region characteristics such as 
existence of necrotic regions. 

GBM Glioblastoma, ICD-O 9440/3, WHO grade IV. A 
Whole Slide Image (WSI) is classified as GBM iff one 
patch can be classified as GBM with high confidence. 
OD Oligodendroglioma, ICD-O 9450/3, WHO grade II. 
OA Oligoastrocytoma, ICD-O 9382/3, WHO grade B; 
Anaplastic oligoastrocytoma, ICD-O 9382/3, WHO 
grade III. This mixed glioma subtype is hard to clas¬ 
sify even by pathologists ll22l . 

DA Diffuse astrocytoma, ICD-O 9400/3, WHO grade II. 
AA Anaplastic astrocytoma, ICD-O 9401/3, WHO grade 
III. 

AO Anaplastic oligodendroglioma, ICD-O 9451/3, WHO 
grade III. 

LGG Low-Grade-Glioma. Include OD, OA, DA, AA, AO. 
see Squamous cell carcinoma, ICD-O 8070/3. 

ADC Adenocarcinoma, ICD-O 8140/3. 

ADC-mix ADC with mixed subtypes, ICD-O 8255/3. 
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